A Finite State Model for Urdu Nastalique Optical Character Recognition
نویسندگان
چکیده
Finite state technology is being used since long to model NLP (Natural Language Processing) applications specially it has very successfully applied to machine translation and speech recognition systems. Character recognition in cursive scripts or handwritten Latin script also have attracted researchers’ attention and some research is also done in this area. Optical character recognition is the translation of optically scanned bitmaps of printed or written text into digitally editable data files. OCRs developed for many world languages are already under efficient use but none exist for Nastalique – a calligraphic adaptation of the Arabic script, just as Jawi is for Malay. Urdu has 39 characters against the Arabic 28. Each character then has 2-4 different shapes according to their position in the word: initial, medial, final and isolated. In Nastalique, word and character overlapping makes optical recognition more complex. Optical character recognition of the Latin script is relatively easier. This paper based on research on Nastalique OCR discusses a proposed finite state model for the optical recognition of Nastalique printed text.
منابع مشابه
Segmentation Based Urdu Nastalique OCR
Urdu Language is written in Nastalique writing style, which is highly cursive, context sensitive and is difficult to process as only the last character in its ligature resides on the baseline. This paper focuses on the development of OCR using Hidden Markov Model and rule based post-processor. The recognizer gets the main body (without diacritics) as input and recognizes the corresponding ligat...
متن کاملSegmentation of Nastaliq Script for OCR
In this paper we have presented a novel segmentation technique for the implementation of an OCR (Optical Character Recognition) for printed Nastalique text, a calligraphic style of Urdu which uses the Arabic script for its writing. OCR for many of the world major languages have been developed and are being used but at present an OCR for Nastalique is not available and the published research on ...
متن کاملDiacritics Recognition Based Urdu Nastalique OCR System
Improvements and new developments in the field of Artificial Intelligence have opened new horizons in the advancement of machines that originally have limited intelligence. As compared to human brain, machines have already better computational speed and storage however there is still much room to improve the capability to acquire and process data and draw conclusions from it on its own. Optical...
متن کاملRecognition of Urdu Character with Hmm Technique
This paper deals with an Optical Character Recognition system for printed Urdu, a popular Pakistani/Indian script and is the third largest understandable language in the world, especially in the subcontinent but fewer efforts are made to make it understandable to computers. Lot of work has been done in the field of literature and Islamic studies in Urdu, which has to be computerized. Research h...
متن کاملOptical Character Recognition System for Urdu Words in Nastaliq Font
Optical Character Recognition (OCR) has been an attractive research area for the last three decades and mature OCR systems reporting near to 100% recognition rates are available for many scripts/languages today. Despite these developments, research on recognition of text in many languages is still in its early days, Urdu being one of them. The limited existing literature on Urdu OCR is either l...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009